-
Sergiy Migdalskiy - Performance Optimization, SIMD and Cache .
-
"A rehash of Sergiy Migdalskiy GDC 2015 talk: Performance Optimization for Physics".
-
He's from Valve; worked on Left4dead.
-
Excellent video!! Great explanation.
-
It's talked about minimizing branches, a technique for a different approach to use pointers as a way to store them in disk, SOA, SIMD.
-
{36:10} SIMD.
-
-
Data should be independent from each other, just as any parallel operation would want.
CPU ARM
NEON
-
Vector width: 128-bit registers
-
Typical lane count: 4 lanes Γ 32-bit (e.g., 4 Γ
float32)
SVE / SVE2 (Scalable Vector Extension)
-
Vector width: Variable.
-
Register width is not fixed (128β2048 bits in 128-bit steps).
-
Code is vector-length agnostic (designed to scale across cores).
-
-
Typical lane count: 8 lanes Γ 32-bit (8 Γ
float32)
CPU x86/x64 (Intel / AMD)
FMA3 / FMA4 (Fused multiply-add)
-
Is often used in combination with AVX/AVX2/AVX-512.
-
Platform: x86/x64 (Intel & AMD)
-
Vector width: 256-bit registers
-
Typical lane count: 8 lanes Γ 32-bit (8 Γ
float32)
AVX-512
-
The adoption is limited (mainly HPC, data centers, or select Intel chips).
-
Includes masking, scatter/gather, and more advanced operations.
-
Platform: x86/x64 (Intel only in select CPUs, not widely available)
-
Vector width: 512-bit registers
-
Typical lane count: 16 lanes Γ 32-bit (16 Γ
float32)
AVX / AVX2 (Advanced Vector Extensions)
-
AVX2 Added full integer support
-
Platform: x86/x64 (Intel & AMD)
-
Vector width: 256-bit registers
-
Typical lane count: 8 lanes Γ 32-bit (8 Γ
float32)
SSE (Streaming SIMD Extensions)
-
Superseded by AVX.
-
SSE1βSSE4 progressively added instructions but retained 128-bit width.
-
Platform: x86/x64 (Intel & AMD)
-
Vector width: 128-bit registers
-
Typical lane count: 4 lanes Γ 32-bit (4 Γ
float32)
MMX
-
Legacy, obsolete.
-
Platform: x86/x64 (Intel & AMD)
-
Vector width: 64-bit registers
-
Typical lane count: 2 lanes Γ 32-bit (8 Γ
float32)
RISC-V (Reduced Instruction Set Computer) (Risk-Five)
-
Is an open, modular instruction set architecture (ISA) based on the RISC (Reduced Instruction Set Computer) design principles.
-
Unlike proprietary ISAs (e.g., x86 by Intel/AMD, ARM by Arm Ltd.), RISC-V is:
-
Open source β Anyone can use or implement it without licensing fees.
-
Modular β It has a minimal base instruction set, with optional extensions (e.g., floating-point, SIMD, vector).
-
RVV (RISC-V Vector Extension)
-
Similar to ARM SVE, RVV allows hardware to define vector width.
-
Not fixed to 128, 256, or 512 bitsβcode adapts dynamically.
-
Scalable width: Vector registers can be from 128 to 2048 bits, depending on hardware.
-
Vector-Length Agnostic (VLA):
-
Programs donβt assume a fixed vector width.
-
Code adapts to hardware at runtime β same binary works on 128-bit or 512-bit hardware.
-
GPU
-
GPUs use SIMT (Single Instruction, Multiple Thread), not SIMD per se, but functionally similar at scale.
CUDA
-
NVIDIA GPUs
-
Vector width: Scalar SIMT
-
Typical lane count: 8 lanes Γ 32-bit (8 Γ
float32)
OpenCL
-
Cross-vendor GPU compute
-
Vector width: Variable
-
Typical lane count: 8 lanes Γ 32-bit (8 Γ
float32)
Wavefronts / Warps
-
Used in GPU shaders
-
Vector width: 32/64 threads
-
Typical lane count: 8 lanes Γ 32-bit (8 Γ
float32)